Artificial neural networks optimize the establishment of a Brazilian germplasm core collection of winter squash (Cucurbita moschata D.)

With widespread cultivation, Cucurbita moschata stands out for the carotenoid content of its fruits such as β and α-carotene, components with pronounced provitamin A function and antioxidant activity. C. moschata seed oil has a high monounsaturated fatty acid content and vitamin E, constituting a lipid source of high chemical–nutritional quality. The present study evaluates the agronomic and chemical–nutritional aspects of 91 accessions of C. moschata kept at the BGH-UFV and propose the establishment of a core collection based on multivariate approaches and on the implementation of Artificial Neural Networks (ANNs). ANNs was more efficient in identifying similarity patterns and in organizing the distance between the genotypes in the groups. The averages and variances of traits in the CC formed using a 15% sampling of accessions, were closer to those of the complete collection, particularly for accumulated degree days for flowering, the mass of seeds per fruit, and seed and oil productivity. Establishing the 15% CC, based on the broad characterization of this germplasm, will be crucial to optimize the evaluation and use of promising accessions from this collection in C. moschata breeding programs, especially for traits of high chemical–nutritional importance such as the carotenoid content and the fatty acid profile.

Table 1.Origin of part of the C. moschata accessions kept in the Vegetable Germplasm Bank of the Federal University of Viçosa.The two letters associated with the accessions names refer to Brazilian states where the accessions were collected, namely Paraná (PR), Santa Catarina (SC), São Paulo (SP), Minas Gerais (MG), Rio de Janeiro (RJ), Espírito Santo (ES), Distrito Federal (DF), Goiás (GO), Rio Grande do Norte (RN), and Bahia (BA).*These genotypes are commercial cultivars widely cultivated in Brazil.

Agro-morphological evaluations
The agro-morphological evaluation was carried out in a field experiment conducted from January to July 2016 at the Experimental Unit of the Department of Agronomy at UFV-"Horta Velha" (20° 4524″ S, 42° 5045″ W; altitude, 648.74 m).The soil in the experimental area is classified as dystrophic Red Yellow Oxisol with a flat topography, and the climate in the region is Cwb, with an average annual temperature of 19.4 °C and annual precipitation of approximately 1200 mm.The evaluation of the genotypes comprised vegetative traits, production, and chemical-nutritional aspects of fruits, seeds, and seed oil.Details about the agro-morphological descriptors used in the evaluation of germplasm are provided in Supplementary Table 2.The genotypes were also evaluated for multi-categorical traits and details about these traits are provided in Supplementary Table 3.The accessions were evaluated together with four commercial cultivars used as controls: the hybrids Tetsukabuto and Jabras (C.moschata × C. maxima) and the cultivars Jacarezinho and Maranhão.These genotypes were evaluated using Federer's augmented block design 31 , with five replications for each control.The four controls were randomly distributed in each block, and the accessions were randomly distributed among all the blocks in equal numbers.The experiment was established using a spacing of 3 × 3 m between plants and rows, with five plants per plot.The production and transplanting of seedlings and the cultural treatments were carried out in accordance with the local recommendations for the crop 32 .
The agro-morphological evaluations were carried out on the three central plants of each plot, using three fruits per plant.The carotenoid content was estimated based on the analysis of colorimetric parameters of fruit pulp, using a manual tristimulus colorimeter (Color Reader CR-10; Konica Minolta, Tokyo, Japan).This assessment was performed as detailed by 18 , according to the equations proposed by 33 , described below: where C corresponds to the saturation or chroma of fruit pulp; a and b correspond to the contribution of red and yellow to the color of fruit pulp (dimensionless), respectively; TC corresponds to the total content of carotenoids, and L corresponds to the lutein content of fruit pulp, both expressed in μg g −1 of fresh fruit mass.
The seed oil content (SOC) was determined using an extractor (ANKOM XT15, ANKON, Macedon, United States), according to a standard method from the Association of Official Analytical Chemists (AOAC), described by 34 .The extraction of seed oil was carried out using mechanical pressing, according to the methodology by 18 , and the fatty acid profile was analyzed using gas chromatography (GC).GC was performed using the GC-17A gas chromatograph (Shimadzu Corporation, Kyoto, Japan), equipped with an automatic insertion platform, flame ionization detector, and a Carbowax capillary column (30 m × 0.25 nm).Chromatography was performed under injection and detection temperatures of 230 and 250 °C, respectively.Column operation started at 200 °C, with an increase of 3 °C•min −1 , until reaching a temperature of 225 °C.Nitrogen was used as a carrier gas with a flow rate of 1.3 L•min −1 , and the concentration of each methyl ester was determined as a percentage of the relative peak area.

Implementation of restricted maximum likelihood procedures and the best linear unbiased prediction for analysis of agromophological data
Agromophological data were analysed from restricted maximum likelihood (REML) procedures and the best linear unbiased prediction (BLUP).This analysis was carried out using the R program and the package lme4 35 .The genotypic values of accessions (BLUP) and controls (BLUES) were obtained from the BLUP, while the estimates of variance components were obtained from the REML, based on the following model: where, y representes the phenotypic data vector, b representes the vector of blocks effect ssumed to be random, a representes the vector of accessions effect assumed to be random, t representes the vector of controls effect assumed to be fixed, and e represents the error vector.The letters W, X and Z represents the incidence matrices of parameters b, a, and t, respectively, with the data vector y.Both multivariate and ANNs analysis were carried out using the estimates of BLUP and BLUES.
The estimates of variance components comprised only the genotypic variance (σ 2 g ).Heritability was obtained based on the following estimator: h 2 = 1 − (Pev/σ 2 g ), where Pev represents the prediction of error variance 36 .

Analysis of genetic variability using multivariate approaches and artificial neural networks
Multivariate analysis included the grouping of genotypes and the distribution of accessions in relation to principal components.Multivariate analysis of variability was carried out using both quantitative and multi-categorical information.For quantitative data, the distance matrix between genotypes was obtained using the standardized average Euclidean distance, from the estimates of BLUPs and BLUES.For multi-categorical data, the distance matrix was obtained from the arithmetic complement of the simple coincidence index.These matrices were then summed, resulting in a single distance matrix.For the sum of the matrices, they were standardized and each one received an equal weight in the summation procedure.The choice of the grouping method was based on cophenetic correlation, opting for the grouping that provided highest cophenetic correlation coefficient; while the determination of number of groups to be formed in clustering was based on the methodology proposed by 37 .Multivariate analysis were performed with the help of Matlab 38 and the Genes software 39 .A principal component analysis was implemented in order to identify the distribution of accessions in relation to the principal components.This analysis considered the data of quantitative and multi-categorical traits, according to the methodology of 40 ; and was implemented with the help of Matlab 38 .
The analysis of the genetic variability organization through neural networks was carried out using Kohonen self-organizing maps (SOM).For this, different two-dimensional hexagonal topological maps were tested in which the N units (neurons) were allocated considering the number of rows and columns, ranging from 1 to 7.This procedure was based on the understanding that defining the topological map and, consequently, the number of neurons and parameters should be based on the researcher's experience, and trial and error methods 41 .Next, the selection of the best network architecture from 2000.00 training sessions for each of the combinations was carried out.The defined network topology had a hexagonal neighborhood.Network analysis was performed with the help of Matlab 38 and the Genes software 39 .
Core collections were established from the random sampling of accessions from the full collection using sampling intensities of 10, 15, 20, and 25%.Thus, 9, 14, 18, and 23 accessions were sampled from the full collection to form the core collections with sampling intensities of 10, 15, 20, and 25%, respectively.The sampling of accessions for the establishment of the core collection was random and with no replacement.The validation of core collections was carried out from the comparison with the complete collection, based on the parameters obtained for the agro-morphological characteristics such as mean and variance 29 .Means and phenotypic variances of variables in the complete collection and nuclear collections were estimated with the aid of the Genes software 39 .

Collection and use of any plant materials statement
The authors declare that the plant collection and use was carried in accordance with all the relevant guidelines.

Phenotypic range and heritability of traits
Based on the distribution analysis of traits, we observed a high phenotypic range for fruit production traits and chemical-nutritional aspects of fruit pulp and seed oil (Fig. 1).These amplitudes were especially higher for the productivity of fruits (PF), the total carotenoid content of fruit pulp (TC), and the oleic and linoleic fatty acid contents.Associated to this, these traits expressed significant genotypic variances and heritability estimates ranging from high to very high (Fig. 1).
Most of the traits expressed greater amplitude between the accessions compared with the hybrids or lines used as controls.

Clustering of genotypes and principal components analysis from multivariate approach
The unweighted pair-group method using arithmetic averages (UPGMA) grouping method provided one of the highest cophenetic correlation indexes (> 0.7) and was adopted for the grouping of genotypes.Analysis of the variability using the multivariate approach showed that accessions and controls were grouped into seven groups.Groups 1 and 2 were the largest groups consisting of 33 and 37 genotypes, respectively (Fig. 2).Group 7 contained only the BGH-6749 genotype and was the smallest group.Group 7 had the lowest number of accumulated degree days for flowering (DDF), followed by groups 6 and 4.These groups also had the lowest averages for DDF.Group 5 contained the genotype with the highest productivity of fruits (44.67 t. ha −1 ) and was the group with the highest average productivity of fruits (21.74 t. ha −1 ).
Groups 2 and 3 contained the genotypes with the highest average for total carotenoid content in fruit pulp, with contents of 187.21 and 181.17 μg g −1 of fresh mass, respectively.Group 4 contained the genotype with the highest content of oleic fatty acid in the oil (40.18%) and was also the group with the highest average for this characteristic (26.28%).Group 4 also contained the genotype with the lowest linoleic fatty acid content.
Figure 3 demonstrates the distribution of accessions in relation to the first two principal components (PC), emphasizing the analysis of genotype variability based on the multivariate approach.PC analysis highlighted accessions BGH-5456A, BGH-1992, and BGH-291 as those with the highest loads in the first PC.The first PC explained 80.6% of the total variation of genotypes in relation to agro-morphological characteristics, and the second PC explained 16.7%.

Organization of the accession's variability from ANNs
Figure 4 shows the variability of accessions from ANN and SOM.It is observed that each neuron concentrated a similar number of accessions and controls, demonstrating an equitable concentration of these genotypes in the neurons (Fig. 4A,B, Table 2).ANNs analysis provided information about the genetic distance between the accessions and controls in each neuron.A tendency for genotypes with greater genetic distance to concentrate in the extreme neurons was observed (Fig. 4C).

Establishment and validation of the core collection
Table 3 shows the list of accessions of each core collection obtained from the different sampling intensities.Validation of the core collections was performed by comparing the mean and variance of the complete collection and the mean and variance of each core collection.
In general, the core collection obtained under a sampling intensity of 15% (15% CC) presented a mean and variance closer to those of the complete collection.The means and variances for degree days accumulated for flowering (DDF), number of fruits per plant (NFP), mass of seeds per fruit (MSF), productivity of seeds (PS), and SOP characteristics using 15% CC were very close to those of the complete collection (Table 4).

Discussion
The high phenotypic ranges observed in this study for traits such as fruit productivity, total carotenoid content of fruit pulp, and oleic and linoleic fatty acid levels are in line with the genetic variability observed in previous studies of the C. moschata germplasm [16][17][18]42 .
Accessions BGH-5455A and BGH-5598A expressed the highest carotenoid contents with 187.21 and 181.17 μg g −1 of fresh pulp mass, respectively.This result is much higher than those reported in previous studies 2, 43 .For example, the study involving the characterization of 55 accessions of C. moschata, also maintained by the BGH-UFV, reported a total content of carotenoids in the fruit pulp not exceeding 118.70 μg g −1 of fresh pulp mass 44 .On the other hand, when evaluating the C. moschata germplasm from Northeastern Brazil, Carvalho et al. 1 reported averages of up to 404.98 μg g −1 .The differences observed in the total content of carotenoids in fruit pulp between the present and previous studies may be mainly associated with the genetic aspects of the germplasm evaluated in each study.Studies with C. moschata generally reported high levels of carotenoids in fruit pulp 1,45 , particularly β-and α-carotene.These components are known for their important biological functions, such as provitamin A 46 and antioxidant activity 4 .
Accessions BGH-5456A, BGH-3333A, BGH-5361A, and BGH-5472A expressed the highest levels of oleic acid.The emphasis on the analysis of the fatty acid profile of C. moschata aims at exploring the potential of this vegetable as an oleaginous crop.Consisting of approximately 75% of UFA and with a high content of MUFA www.nature.com/scientificreports/such as oleic acid 7,8 , the oil from C. moschata seeds is an excellent substitute for lipid sources with high levels of saturated fatty acids, harmful to human health.Corroborating this, studies demonstrate the association between the consumption of lipid sources composed predominantly of saturated fatty acids and the high risk of cardiometabolic pathologies, particularly cardiovascular diseases and type II diabetes mellitus 47,48 .This has encouraged the replacement of saturated lipids in human food with UFA, with a particular focus on vegetable oils-the main source of UFA in the human diet.
Using multivariate analyzes and ANNs highlighted the high variability of C. moschata accessions.Clustering using a radial dendrogram allowed the identification of groups with the most promising averages in terms of accumulated DDF, PF, TC, and fatty acid profile (Fig. 2).The analysis of variability using PC corroborated the accession grouping pattern using the dendrogram, highlighting accessions BGH-5456A, BGH-1992, and BGH-291 as the most divergent (Fig. 3).
The analysis of the organization of the accession's variability from ANNs corroborated the variability observed from the multivariate approach.This was confirmed by the concentration of a similar number of accessions along the neurons (Fig. 4A,B).This demonstrates that the adopted network architecture, consisting of seven columns and seven rows, efficiently organized the variability of the genotypes.Similar to the present study, a series of studies with Kohonen SOM also defined their topology randomly or by trial and error 22,49,50 .With this, it is assumed that the method to find the best architecture should be established judiciously.This is because different results can be obtained each time a SOM is used, given that networks have random synaptic weights at the beginning of training 22 .
Analysis using ANNs identified a tendency for genotypes with greater genetic distance to concentrate in the most extreme neurons (Fig. 4C), information that will support the establishment and validation of the core collections.Thus, genotypes concentrated in the extreme neurons express greater genetic distance.ANNs analysis enabled the organization of the genotypes into closer groups than those obtained from the radial dendrogram grouping (Fig. 2), proving to be more efficient in identifying similarity patterns and in organizing the proximity of genotypes between groups.Close to this, Santos et al. 22 also used the SOM technique as an alternative method  A,B) and genetic distance between the genotypes of each neuron (C).In Fig. 4A, the lighter color denotes greater number of accessions per neuron, while the darker color denotes smaller number of accessions per neuron.The lighter color denotes a greater distance between the genotypes in the neuron, while the darker color denotes a smaller distance in Fig. 4C.
to assess genetic diversity in rice breeding programs.However, it should be noted that there is the possibility of greater variation in the allocation of genotypes in neurons as the number of neurons increase 51 .
The variability observed among the genotypes of C. moschata in the present study is in line with previous studies with this species, characterized by high genetic variability, reflected, at first, in the variation of morphological aspects of plants and fruits.Studies have highlighted the variability of the Brazilian germplasm of C. moschata [16][17][18] , possibly a result of the adaptation of this germplasm to a wide ecological range found in the country, consisting of different edaphoclimatic conditions 15 .In addition, the occurrence of natural hybridization between populations also contributes to the variability in the germplasm of this vegetable 18 .
When establishing core collections, they must be evaluated regarding their ability to maintain the existing variability in the complete collection 29,30 .The averages and variances of agro-morphological characteristics of 15% CC were closest to the averages and variances of the complete collection, particularly in relation to DDF, NFP, MSF, PS, and SOP (Table 4).The 15% CC variances tended to be higher than the complete collection variances for most traits, which indicates that with this sampling intensity, the core collection effectively preserved the complete collection's genetic variability.
The validation of nuclear collections can be carried out using different approaches, such as the analysis of the amplitude coincidence index 30,52 , and are based on parameters analysis such as mean, variance, and amplitude 29,53,54 .For example, when proposing the establishment of a core collection based on the US Department Table 2. Concentration of genotypes in neurons from Kohonen's self-organizing map, as shown in Fig. 4A,B.www.nature.com/scientificreports/ of Agriculture soybean germplasm collection, Oliveira et al. 29 emphasized the analysis of mean, variance, and amplitude observed in core collections as an approach for their validation.In this sense, Frankel 55 highlighted that the sampling strategy is efficient when the core collection retains at least 80% of the original amplitude for a trait.The establishment of a core collection aims to maintain the greatest possible variability from a minimum number of accessions, thus providing greater efficiency in identifying useful genetic diversity by breeders and other scientists.Given this, it is assumed that the 15% CC was effective since it presented means and variances very close to those of the complete collection and a number of accessions considerably lower than the full collection 29,56 .

Core collection (%) Accessions
According to 27 , establishing a core collection provides advantages for both collection curators and breeders.With the proposal of a core collection, two hierarchical levels are established, namely the core collection and the complete collection.From this, the curators can prioritize conservation activities such as germination and regeneration tests, in the core collections, in addition to concentrating efforts in the characterization and evaluation of the accessions of these collections.For breeders, evaluations of core collections often become less onerous due to the smaller number of accessions in these collections.
With the present study, the agro-morphological characterization of the collection of C. moschata maintained at the BGH-UFV approaches its conclusion 18,44,57 .Constituting a substantial sample of the Brazilian germplasm of C. moschata and one of the largest collections of this species in the country 20 , the characterization of this collection has covered the evaluation of an extensive set of characteristics, including the analysis of resistance against important phytopathogens of the crop, fruit and seed productivity; as well as chemical-nutritional aspects of fruits, seeds and seed oil 18,44,57 .Previous studies with the collection of C. moschata at the BGH-UFV allowed the identification of promising accessions as sources of genes for genetic improvement of this species.
The implementation of ANNs in the present study proved to be a useful tool to base the establishment of core collections, allowing a clearer distinction of the formed groups compared to the multivariate approach.Implementing ANNs for analyzing the organization of germplasm variability initially brings the advantage of mapping even trends or performances that do not follow linear behaviors 58 .Additionally, multivariate approaches bring disadvantages such as their association with the experimentation process and the nature of the data set.Therefore, a series of factors related to how the experimentation is conducted can compromise the efficiency of these analyses.For example, different genetic distance indices might be recommended for analyzing the diversity of a set of genotypes, depending on the statistical design in which they were evaluated.The Euclidean distance index, for example, is indicated for cases in which samples under evaluation have not been evaluated with repetition 59 , and in this case, the multivariate analysis does not include environmental errors that possibly have influenced the average results of samples.On the other hand, if there was repetition, the Mahalanobis distance is recommended 59 , which allows environmental errors to be contemplated in the multivariate analysis.The use of distance measures, such as the Euclidean ones, is restricted to quantitative data and recommended for cases in which there is no correlation between the variables, that is, for cases in which the variables are independent.
C. moschata crop presents characteristics that make the evaluation of its germplasm challenging.This species is characterized by branches with vigorous growth and long internodes 32,60 , which requires an extensive area for the evaluation of a reduced number of accessions, making the process costly.On the other hand, as already explained, the fruits and seeds of C. moschata express high nutritional value.Its fruits are characterized by a high content of carotenoids such as β-and α-carotene 1,61 , components with high provitamin A and antioxidant function 4,46 .Moreover, the seed oil of C. moschata consists of approximately 75% of UFA and has a high content of MUFA such as oleic acid 7,8 , components that are beneficial to human health.The establishment of the core collection proposed in the present study will be crucial to optimize the evaluation and use of promising accessions from this collection, especially for characteristics of high chemical-nutritional importance, such as the carotenoid profile of fruit pulp and the fatty acid profile of seed oil.The core collection could also be used as a source of alleles for genetic improvement programs of C. moschata and other cucurbits.

Conclusion
The accessions of C. moschata expressed a considerable phenotypic range for productivity of fruits, total carotenoid content of fruit pulp, and oleic and linoleic fatty acid contents, which enabled the identification of promising accessions for use as a source of genes for genetic improvement of these traits.Multivariate analyzes and the approach using ANNs highlighted the high variability of C. moschata accessions evaluated in this study.The variability organization of accessions from ANNs corroborated the variability of accessions observed from the multivariate approach.This demonstrates that the network architecture adopted efficiently organized the genotype variability.ANNs were able to organize the genotypes into closer groups than those obtained from the radial dendrogram grouping, proving to be more efficient in identifying similarity patterns and in organizing the proximity of genotypes between groups.This information was fundamental to supporting the core collections' establishment and validation.
The averages and variances of agro-morphological traits using 15% CC were those closest to the averages and variances of the complete collection, particularly in relation to DDF, NFP, MSF, PS, and SOP, demonstrating that this core collection was efficient in maintaining the variability of accessions.Establishing the 15% CC will be crucial to optimize the evaluation and use of promising accessions from this collection, especially for traits of high chemical-nutritional importance, such as the carotenoid profile of fruit pulp and the fatty acid profile of seed oil.

Figure 1 .
Figure 1.Frequency distribution of characteristics associated with fruit production and chemical-nutritional aspects of fruit pulp and seed oil.DDF, Accumulated degree days for flowering; NFP, Number of fruits per plant; PF, Productivity of fruits; TC, Total carotenoid content of fruit pulp; PS, Productivity of seeds; SOP, Seed oil productivity; LAC, Linoleic acid content, and OAC, oleic acid content.

Figure 2 .
Figure 2. Grouping of accessions and controls based on a multivariate approach.

Figure 3 .
Figure 3. Genetic variability of the 91 accessions of C. moschata kept in BGH-UFV from principal components (multivariate approach), showing the dispersion of genotypes in relation to the first two principal components.

Figure 4 .
Figure 4. Kohonen's self-organizing map demonstrating the concentration and genetic distances of genotypes in neurons.Distribution of genotypes in neurons (A,B) and genetic distance between the genotypes of each neuron (C).In Fig.4A, the lighter color denotes greater number of accessions per neuron, while the darker color denotes smaller number of accessions per neuron.The lighter color denotes a greater distance between the genotypes in the neuron, while the darker color denotes a smaller distance in Fig.4C.

Table 3 .
List of accessions in core collections formed from different sampling intensities.

Table 4 .
Means and variances of agro-morphological traits in the different core collections.DDF, Accumulated degree days for flowering; NFP, Number of fruits per plant; PF, productivity of fruits; TC, Total carotenoid content of fruit pulp; MSF, Mass of seeds per fruit; PS, Productivity of seeds; SOP, Seed oil productivity; LAC, Linoleic acid content; and OAC, oleic acid content.